Financial and Economic Data Applications

In this final lecture, we will look at a number of useful applications of the techniques discussed in previous weeks. These all come from the worlds of finance and economics, mostly because of the abundance of data. We will also be using Wes McKinney's template .ipynb file as a skeleton for the lectures.

Throughout, we will make use of some technical jargon. Here are frequent terms that you should be familiar with:

cross-sectional data: data that comprise a fixed point in time (for example, the current closing price of publicly traded companies on April 1 2015).
panel data: multi-dimensional data, which frequently involve time series measurements but can also be cross-sectional.
futures contract: a financial instrument in which two parties agree to the sale of a distinct instrument (such as corn, oil, or a stock) sometime into the future, thereby deriving its value from the value of an underlying asset. As such, it is a form of derivative contract.



In [ ]:

    
from __future__ import division
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt



In [ ]:

    
%matplotlib inline

Data munging topics

In previous lectures we have discussed a number of useful tools that are present within the Python data analysis ecosystem for data munging. Here we will overview a few of these tools in action.

Time series and cross-section alignment

Suppose you are given two different financial datasets on which you hope to perform a reasonable analysis. More likely than not, the series may have indices that don't line up perfectly, or (if the data is stored into a DataFrame) might have columns or row labels that don't match.

This is a terribly frequent and frustrating problem in data analysis, so much so that it has its own name: the data alignment problem. In Pandas, basic arithmetic operations between data sets performs automatic alignment. For example, let us consider the following datasets.

The stock_px.csv dataset can be downloaded here and the volume.csv dataset can be downloaded here.



In [ ]:

    
close_px = pd.read_csv('stock_px.csv', parse_dates=True, index_col=0)
volume = pd.read_csv('volume.csv', parse_dates=True, index_col=0)
prices = close_px.ix['2011-09-05':'2011-09-14', ['AAPL', 'JNJ', 'SPX', 'XOM']]
volume = volume.ix['2011-09-05':'2011-09-12', ['AAPL', 'JNJ', 'XOM']]



In [ ]:

    
prices



In [ ]:

    
volume

One useful metric in the financial analysis of stock prices is the volume-weighted average price, or VWAP, of a stock. This metric weights the price of a stock at any given time by the amount of trades are occurring at that time. The idea is that high-volume trades give a better indication of the perceptions of institutional investors, who presumably have expert understanding, of the value of the stock.

Since Pandas aligns the data automatically and excludes missing data in functions like sum, we can express this concisely as:



In [ ]:

    
prices * volume



In [ ]:

    
vwap = (prices * volume).sum() / volume.sum()



In [ ]:

    
vwap

Because no volume data is given for the SPX exchange-traded fund, the VWAP value for the asset is consequently absent. We can remove it from the above series by a simple call to dropna.



In [ ]:

    
vwap.dropna()

Manual alignment can be achieved by using the align method built-in to DataFrame objects.



In [ ]:

    
prices.align(volume, join='inner')

You can also build DataFrame objects with Series objects that may potentially have different individual shapes without a hitch, because of Pandas automatic alignment.



In [ ]:

    
s1 = Series(range(3), index=['a', 'b', 'c'])
s2 = Series(range(4), index=['d', 'b', 'c', 'e'])
s3 = Series(range(3), index=['f', 'a', 'c'])
DataFrame({'one': s1, 'two': s2, 'three': s3})

Again, NaN is assigned to values for which the original Series do not cover respectively.

You can specify explicitly the index of the result, discarding the rest:



In [ ]:

    
DataFrame({'one': s1, 'two': s2, 'three': s3}, index=list('face'))

Operations with time series of different frequencies

The Federal Reserve publishes new GDP data every quarter, but only publishes inflation data annually. Publicly listed firms are required to provide income and balance sheet data quarterly, but it need not happen at the same day for every company. These types of problems for data analysts fall under the category of time series frequency problems, and Pandas provides a number of techniques for solving them.

Suppose you have a time series that contains data compiled weekly (on Wednesdays):



In [ ]:

    
ts1 = Series(np.random.randn(3),
             index=pd.date_range('2012-6-13', periods=3, freq='W-WED'))
ts1

In certain circumstances, it might be useful to resample the data to different frequencies. For example, if you need to resample the data so that an entry is provided for every business day in the period, you can employ:



In [ ]:

    
ts1.resample('B')

Notice here that because no new information is provided, all of the new dates are simply left NaN. If you want to fill these gaps with prevous data, you can apply various fillers with the fill_method parameter specified.



In [ ]:

    
ts1.resample('B', fill_method='ffill')

This remedy is an elegant solution to upsampling from lower frequency data to higher frequency data, but another class of frequency problems involve irregular time series data, in which the above methods will not work as neatly. Consider the following data set containing irregularly sampled data:



In [ ]:

    
dates = pd.DatetimeIndex(['2012-6-12', '2012-6-17', '2012-6-18',
                          '2012-6-21', '2012-6-22', '2012-6-29'])
ts2 = Series(np.random.randn(6), index=dates)
ts2

If you want to add the "as of" values in ts1 to ts2, one option would be to resample both to a regular frequency and then add, but if you want to maintain the date index in ts2, using reindex is a more precise solution:



In [ ]:

    
ts1.reindex(ts2.index, method='ffill')



In [ ]:

    
ts2 + ts1.reindex(ts2.index, method='ffill')

Using periods instead of timestamps

Periods, as opposed to timestamps, are another common approach to organizing time series data, which can lead to its own time series frequency problems. Suppose you have the following GDP and inflation data, which are both periodic but at different frequencies:



In [ ]:

    
gdp = Series([1.78, 1.94, 2.08, 2.01, 2.15, 2.31, 2.46],
             index=pd.period_range('1984Q2', periods=7, freq='Q-SEP'))
infl = Series([0.025, 0.045, 0.037, 0.04],
              index=pd.period_range('1982', periods=4, freq='A-DEC'))
gdp



In [ ]:

    
infl

In Pandas, unlike with timestamps, operations between different-frequency time series that are indexed by periods are not possible without explicit conversions. In this case, if we know that infl values were observed at the end of the year, we can then convert to Q-SEP to get the right periods in that frequency:



In [ ]:

    
infl_q = infl.asfreq('Q-SEP', how='end')
infl_q

Then, we can simply reindex the inflation time series data with a forward-filling method to match the GDP data.



In [ ]:

    
infl_q.reindex(gdp.index, method='ffill')

Time of day and "as of" data selection

Suppose you have a long time series containing intraday market data and you want to extract the prices at a particular time of day on each day of the data. What if the data are irregular such that observations do not fall exactly on the desired time? In practice this task can make for error-prone data munging if you are not careful. Here is an example for illustration purposes:



In [ ]:

    
# Make an intraday date range and time series
rng = pd.date_range('2012-06-01 09:30', '2012-06-01 15:59', freq='T')
# Make a 5-day series of 9:30-15:59 values
rng = rng.append([rng + pd.offsets.BDay(i) for i in range(1, 4)])
ts = Series(np.arange(len(rng), dtype=float), index=rng)
ts

We can index ts with a datetime.time object to extract the values at 10:00 AM throughout the entire time series.



In [ ]:

    
from datetime import time
ts[time(10, 0)]

Alternatively, you can specify timestamps between two time intervals, such as 10:00 AM and 10:01 AM (inclusively).



In [ ]:

    
ts.between_time(time(10, 0), time(10, 1))



In [ ]:

    
np.random.seed(12346)

However, it could be the case that no data actually fell exactly at 10:00 AM, but you might want to know the last known value at 10:00. In that case, you might do something like the following:



In [ ]:

    
# Set most of the time series randomly to NA
indexer = np.sort(np.random.permutation(len(ts))[700:])
irr_ts = ts.copy()
irr_ts[indexer] = np.nan
irr_ts['2012-06-01 09:50':'2012-06-01 10:00']



In [ ]:

    
selection = pd.date_range('2012-06-01 10:00', periods=4, freq='B')
irr_ts.asof(selection)

The asof method performs this functionality for you.

Splicing together data sources

In financial and economic contexts, there are a number of widely occurring use cases of merging related datasets:

Switching from one data source to another at a specific point in time
"Patching" missing values in a time series at the beginning, middle, or end using another time series
Completely replacing the data for a subset of symbols

In the first case, switching from one set of time series to another at a specific instant, it is a matter of splicing together two TimeSeries or DataFrame objects using pandas.concat:



In [ ]:

    
data1 = DataFrame(np.ones((6, 3), dtype=float),
                  columns=['a', 'b', 'c'],
                  index=pd.date_range('6/12/2012', periods=6))
data2 = DataFrame(np.ones((6, 3), dtype=float) * 2,
                  columns=['a', 'b', 'c'],
                  index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.ix[:'2012-06-14'], data2.ix['2012-06-15':]])
spliced

Suppose in a similar example that data1 was missing a time series present in data2:



In [ ]:

    
data2 = DataFrame(np.ones((6, 4), dtype=float) * 2,
                  columns=['a', 'b', 'c', 'd'],
                  index=pd.date_range('6/13/2012', periods=6))
spliced = pd.concat([data1.ix[:'2012-06-14'], data2.ix['2012-06-15':]])
spliced

Using combine_first, you can bring in data from before the splice point to extend the history for d item:



In [ ]:

    
spliced_filled = spliced.combine_first(data2)
spliced_filled

Since there isn't any data for d at 2012-06-12 in data2, the entry receives an NaN.

The update method for DataFrame objects performs in-place updates to the data. To fill in only holes in the data, pass in overwrite=False.



In [ ]:

    
spliced.update(data2, overwrite=False)



In [ ]:

    
spliced

You can also perform data replacement on any subset of symbols. Below, this is used to fill in values for specific columns:



In [ ]:

    
cp_spliced = spliced.copy()
cp_spliced[['a', 'c']] = data1[['a', 'c']]
cp_spliced

Return indexes and cumulative returns

The return of a financial asset is defined as the cumulative percent change in its price.

Here is some price data for Apple, Inc.:



In [ ]:

    
import pandas.io.data as web
price = web.get_data_yahoo('AAPL', '2011-01-01')['Adj Close']
price[-5:]

Apple does not pay dividends, which would otherwise be included into the calculation of net returns. Thus, a quick-and-dirty computation of returns will suffice:



In [ ]:

    
price['2011-10-03'] / price['2011-3-01'] - 1

Stocks with dividend payments complicate the computation because you have to factor in the stream of payments over time. The Adj Close attempts to adjust for stock splits and dividends, but in any case, it's quite common to derive a return index, which indicates the value of a dollar's investment into the asset. For Apple, let's compute a simple return index using cumprod.



In [ ]:

    
returns = price.pct_change()
ret_index = (1 + returns).cumprod()
ret_index[0] = 1  # Set first value to 1
ret_index.plot(grid=True)

With a return index, we can manipulate the frequency at which we compute the returns quite easily.



In [ ]:

    
m_returns = ret_index.resample('BM', how='last').pct_change()
m_returns['2012']

Since no dividends or other adjustments are considered, we could have alternatively computed from the daily percent changed by resampling with a simple aggregation:



In [ ]:

    
m_rets = (1 + returns).resample('M', how='prod', kind='period') - 1
m_rets['2012']

Then, to include the dividend payments you can simply add the separate dividend payment data as follows:

returns[dividend_dates] += dividend_pcts

Group transforms and analysis

Let's consider a collection of made-up assets. We first generate a universe of 1000 tickers:



In [ ]:

    
pd.options.display.max_rows = 100
pd.options.display.max_columns = 10
np.random.seed(12345)



In [ ]:

    
import random; random.seed(0)
import string

N = 1000
def rands(n):
    choices = string.ascii_uppercase
    return ''.join([random.choice(choices) for _ in xrange(n)])
tickers = np.array([rands(5) for _ in xrange(N)])

Then, we can create a DataFrame containing three columns representing random portfolios for a given subset of the above tickers:



In [ ]:

    
M = 500
df = DataFrame({'Momentum' : np.random.randn(M) / 200 + 0.03,
                'Value' : np.random.randn(M) / 200 + 0.08,
                'ShortInterest' : np.random.randn(M) / 200 - 0.02},
                index=tickers[:M])
df.head()

We can aggregate these random tickers by industry. In this simple example, let's use two industries: financial and technology. We can store the mapping as a Series object.



In [ ]:

    
ind_names = np.array(['FINANCIAL', 'TECH'])
sampler = np.random.randint(0, len(ind_names), N)
industries = Series(ind_names[sampler], index=tickers,
                    name='industry')

Using groupby mechanics, we can group industries and carry out group aggregation and transformations:



In [ ]:

    
by_industry = df.groupby(industries)
by_industry.mean()

Of course, remember the handy describe method:



In [ ]:

    
by_industry.describe()

We can transform these portfolios along a particular industry by defining customized transformation functions. For example, standardizing within industry is widely used in equity portfolio research:



In [ ]:

    
# Within-Industry Standardize
def zscore(group):
    return (group - group.mean()) / group.std()

df_stand = by_industry.apply(zscore)

You can verify that each industry has mean (very nearly) 0 and standard deviation 1:



In [ ]:

    
df_stand.groupby(industries).agg(['mean', 'std'])

Other, built-in kinds of transforms, such as rank, can be used to make the analysis more concise.



In [ ]:

    
# Within-industry rank descending
ind_rank = by_industry.rank(ascending=False)
ind_rank.groupby(industries).agg(['min', 'max'])

In quantitative equity, "rank and standardize" is a common sequence of transformations. You can compose these operations concisely with a well-placed lambda, as follows:



In [ ]:

    
# Industry rank and standardize
by_industry.apply(lambda x: zscore(x.rank())).head()

Group factor exposures

Quantitative portfolio management takes heavy advantage of factor analysis. From Wikipedia:

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in four observed variables mainly reflect the variations in two unobserved variables. Factor analysis searches for such joint variations in response to unobserved latent variables.

Portfolio holdings and performance are decomposed using one or more factors represented as a portfolio of weights. A common example is a stock's beta, which measures co-movement between a stock and a benchmark (like the S&P 500). We can consider a contrived example of a portfolio constructed from three randomly-generated factors (usually called factor loadings) and some weights:



In [ ]:

    
from numpy.random import rand
fac1, fac2, fac3 = np.random.rand(3, 1000)

ticker_subset = tickers.take(np.random.permutation(N)[:1000])

# Weighted sum of factors plus noise
port = Series(0.7 * fac1 - 1.2 * fac2 + 0.3 * fac3 + rand(1000),
              index=ticker_subset)
factors = DataFrame({'f1': fac1, 'f2': fac2, 'f3': fac3},
                    index=ticker_subset)

Vector correlations between each factor and the portfolio may not indicate too much:



In [ ]:

    
factors.corrwith(port)

The standard method to compute factor exposure is by least squares regression. You can do so with a number of Python libraries, from SciPy and NumPy to more advanced libraries such as statsmodels. However, Pandas makes the process particularly easy with its pandas.ols method.



In [ ]:

    
pd.ols(y=port, x=factors).beta

Compare these with the original factor weights that were provided above arbitrarily, and you will see that this regression performed considerably better than the corrwith results; we have almost completely recovered the weights. With groupby you can compute exposures industry by industry. To do so, encapsulate the regression method with a new function, such as:



In [ ]:

    
def beta_exposure(chunk, factors=None):
    return pd.ols(y=chunk, x=factors).beta

Then, simply group by industries and apply the function, passing the DataFrame of factor loadings:



In [ ]:

    
by_ind = port.groupby(industries)
exposures = by_ind.apply(beta_exposure, factors=factors)
exposures.unstack()

Decile and quartile analysis

In many circumstances it is useful to break down data based on sample quantiles. For example, the performance of a stock portfolio could be separated into quartiles based on each stock's price-to-earnings ratio. With Pandas, the method pandas.qcut combined with groupby makes this a straightforward task.

Consider a simple trend following or momentum strategy trading the S&P 500 index via the SPY exchange-traded fund.



In [ ]:

    
import pandas.io.data as web
data = web.get_data_yahoo('SPY', '2006-01-01')
data.info()

We compute the daily returns and a function for transforming the returns into a trend signal formed by a lagged moving sum:



In [ ]:

    
px = data['Adj Close']
returns = px.pct_change()

def to_index(rets):
    index = (1 + rets).cumprod()
    first_loc = max(index.index.get_loc(index.idxmax()) - 1, 0)
    index.values[first_loc] = 1
    return index

def trend_signal(rets, lookback, lag):
    signal = pd.rolling_sum(rets, lookback, min_periods=lookback - 5)
    return signal.shift(lag)

Using this function, we can create and test a trading strategy that trades this momentum signal every Friday:



In [ ]:

    
signal = trend_signal(returns, 100, 3)
trade_friday = signal.resample('W-FRI').resample('B', fill_method='ffill')
trade_rets = trade_friday.shift(1) * returns
trade_rets = trade_rets[:len(returns)]

We can then convert the strategy returns to a return index and plot them:



In [ ]:

    
to_index(trade_rets).plot(grid=True, figsize=(12,6))

Caveat: this is a naive strategy!

Suppose that you want to decompose the strategy performance into more and less volatile periods of trading. Trailing one-year annualized standard deviation is a simple measure of volatility, and we can compute Sharpe ratios to assess the reward-to-risk ratio in various volatility regimes:



In [ ]:

    
vol = pd.rolling_std(returns, 250, min_periods=200) * np.sqrt(250)

def sharpe(rets, ann=250):
    return rets.mean() / rets.std()  * np.sqrt(ann)

Now we can divide vol into quartiles with pd.qcut and aggregating with sharpe:



In [ ]:

    
cats = pd.qcut(vol, 4)
print('cats: %d, trade_rets: %d, vol: %d' % (len(cats), len(trade_rets), len(vol)))



In [ ]:

    
trade_rets.groupby(cats).agg(sharpe)

These results show that the strategy performed the best during the period when the volatility was the highest.

More example applications

Future contract rolling

From Python for Data Analysis:

In practice, modeling and trading futures contracts on equities, currencies, commodities, bonds, and other asset classes is complicated by the time-limited nature of each contract. For example, at any given time for a type of future (say silver or copper futures) multiple contracts with different expiration dates may be traded. In many cases, the future contract expiring next (the near contract) will be the most liquid (highest volume and lowest bid-ask spread.

For the purposes of modeling and forecasting, it can be much easier to work with a continuous return index indicating the profit and loss associated with always holding the near contract. Transitioning from an expiring contract to the next (or far) contract is referred to as rolling. Computing a continuous future series from the individual contract data is not necessarily a straightforward exercise and typically requires a deeper understanding of the market and how the instruments are traded. For example, in practice when and how quickly would you trade out of an expiring contract and into the next contract?

We will go into one way to do so. Using scaled prices for the SPY exchange-traded fund as a proxy for the S&P 500, we have:



In [ ]:

    
pd.options.display.max_rows = 10
import pandas.io.data as web
# Approximate price of S&P 500 index
px = web.get_data_yahoo('SPY')['Adj Close'] * 10
px

Now, a little bit of preamble. Let's put a couple of S&P 500 future contracts and expiry dates in a Series object.



In [ ]:

    
from datetime import datetime
expiry = {'ESU2': datetime(2012, 9, 21),
          'ESZ2': datetime(2012, 12, 21)}
expiry = Series(expiry).order()
expiry

Using Yahoo! Finance prices and a random walk with noise, we can simulate the two contracts into the future.



In [ ]:

    
np.random.seed(12347)
N = 200
walk = (np.random.randint(0, 200, size=N) - 100) * 0.25
perturb = (np.random.randint(0, 20, size=N) - 10) * 0.25
walk = walk.cumsum()

rng = pd.date_range(px.index[0], periods=len(px) + N, freq='B')
near = np.concatenate([px.values, px.values[-1] + walk])
far = np.concatenate([px.values, px.values[-1] + walk + perturb])
prices = DataFrame({'ESU2': near, 'ESZ2': far}, index=rng)

prices then has two time series for each contract that differ by a random amount.



In [ ]:

    
prices.tail()

The technique that we will use to splice these two separate time series into a single continuous series is by constructing a weighting matrix. Active constraints have a weight of 1 until the expiry date gets close. At that point, we decide on a roll convention. Here is a function that computes a weighting matrix with linear decay:



In [ ]:

    
def get_roll_weights(start, expiry, items, roll_periods=5):
    # start : first date to compute weighting DataFrame
    # expiry : Series of ticker -> expiration dates
    # items : sequence of contract names

    dates = pd.date_range(start, expiry[-1], freq='B')
    weights = DataFrame(np.zeros((len(dates), len(items))),
                        index=dates, columns=items)

    prev_date = weights.index[0]
    for i, (item, ex_date) in enumerate(expiry.iteritems()):
        if i < len(expiry) - 1:
            weights.ix[prev_date:ex_date - pd.offsets.BDay(), item] = 1
            roll_rng = pd.date_range(end=ex_date - pd.offsets.BDay(),
                                     periods=roll_periods + 1, freq='B')

            decay_weights = np.linspace(0, 1, roll_periods + 1)
            weights.ix[roll_rng, item] = 1 - decay_weights
            weights.ix[roll_rng, expiry.index[i + 1]] = decay_weights
        else:
            weights.ix[prev_date:, item] = 1

        prev_date = ex_date

    return weights

The weights look like this around the ESU2 expiry:



In [ ]:

    
weights = get_roll_weights('6/1/2012', expiry, prices.columns)
weights.ix['2012-09-12':'2012-09-21']

Finally, the rolled future returns are just a weighted sum of the contract returns:



In [ ]:

    
rolled_returns = (prices.pct_change() * weights).sum(1)

Rolling correlation and linear regression

From Python for Data Analysis:

Dynamic models play an important role in financial modeling as they can be used to simulate trading decisions over a historical period. Moving window and exponentially-weighted time series functions are an example of tools that are used for dynamic models.

Correlation is one way to look at the co-movement between the changes in two asset time series. panda's rolling_corr function can be called with two return series to compute the moving window correlation.

First, let's load some stocks from Yahoo! Finance and compute the daily returns.



In [ ]:

    
aapl = web.get_data_yahoo('AAPL', '2000-01-01')['Adj Close']
msft = web.get_data_yahoo('MSFT', '2000-01-01')['Adj Close']

aapl_rets = aapl.pct_change()
msft_rets = msft.pct_change()

Then, compute and plot the one-year moving correlation.



In [ ]:

    
plt.figure()
pd.rolling_corr(aapl_rets, msft_rets, 250).plot(grid=True, figsize=(12,6))

To better capture the differences in volatility, we can use least-squares regression. OLS regression can model the dynamic relationship between a variable and one or more other predictor variables.



In [ ]:

    
plt.figure()
model = pd.ols(y=aapl_rets, x={'MSFT': msft_rets}, window=250)
model.beta



In [ ]:

    
model.beta['MSFT'].plot(grid=True, figsize=(12,6))

There are of course more sophisticated techniques than OLS regression, which can be found in the statsmodels project, but this gives a flavor for the kind of analysis that is possible.